Unraveling the English-Bengali Code-Mixing Phenomenon
نویسندگان
چکیده
Code-mixing is a prevalent phenomenon in modern day communication. Though several systems enjoy success in identifying a single language, identifying languages of words in code-mixed texts is a herculean task, more so in a social media context. This paper explores the English-Bengali code-mixing phenomenon and presents algorithms capable of identifying the language of every word to a reasonable accuracy in specific cases and the general case. We create and test a predictorcorrector model, develop a new code-mixed corpus from Facebook chat (made available for future research) and test and compare the efficiency of various machine learning algorithms (J48, IBk, Random Forest). The paper also seeks to remove the ambiguities in the token identification process.
منابع مشابه
Sentiment Analysis of Code-Mixed Indian Languages: An Overview of SAIL_Code-Mixed Shared Task @ICON-2017
Sentiment analysis is essential in many real-world applications such as stance detection, review analysis, recommendation system, and so on. Sentiment analysis becomes more difficult when the data is noisy and collected from social media. India is a multilingual country; people use more than one languages to communicate within themselves. The switching in between the languages is called code-sw...
متن کاملMainland Chinese Students’ Shifting Perceptions of Chinese-English Code-Mixing in Macao
As a former Portuguese colony, Macao is the only region in China where Cantonese, a variety of Chinese, and English, an international language, are enjoying de facto official statuses, with Putonghua being a quasi-official language and Portuguese being another official language. Recently, with an increasing number of Mainland Chinese students crossing the border to pursue their tertiar...
متن کاملPreparing Bengali-English Code-Mixed Corpus for Sentiment Analysis of Indian Languages
Analysis of informative contents and sentiments of social users has been attempted quite intensively in the recent past. Most of the systems are usable only for monolingual data and fails or gives poor results when used on data with code-mixing property. To gather attention and encourage researchers to work on this crisis, we prepared gold standard Bengali-English code-mixed data with language ...
متن کاملDevelopment of a Cantonese-English code-mixing speech corpus
This paper describes the design and compilation of the CUMIX Cantonese-English code-mixing speech corpus. Code-mixing is a common phenomenon in many bilingual societies and it usually involves at least two different languages within one utterance. In Hong Kong, people usually mix English words and phrases with Cantonese in their daily conversation. Although there are many monolingual corpora of...
متن کاملThe Effects of Oral Code-mixing and Glossing on Iranian EFL Learners' Vocabulary Knowledge
The current study investigated the effects of oral code-mixing and glossing on L2 vocabulary learning. To this end, 60 EFL learners studying at pre-university school were given a pre-test to make sure that they did not have any prior knowledge of the target words. Based on their scores in the pre-test, 36 pre-university students were selected and divided into three groups, including two experim...
متن کامل